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Abstract 

Folding of protein-like heteropolymers into unique 3D structures is investi- 
gated using Monte Carlo simulations on a cubic lattice. We found that folding 
time of chains of length iV scales as N x at temperature of fastest folding. For 
chains with random sequences of monomers A ~ 6, and for chains with se- 
quences designed to provide a pronounced minimum of energy to their ground 
state conformation A ~ 4. Folding at low temperatures exhibits an Arrhenius- 
like behavior with the energy barrier Ej, ~ <p\E n \, where E n is the energy of 

the native conformation. ~ 0.18 both for random and designed sequences. 
PACS numbers: 87.10.+e, 82.20.Db, 05.70.Fh, 64.60. Cn 
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The number of all possible conformation (shapes) of a protein of N amino acids scales as 
z N where z is the number of conformations per one amino acid residue. Thus, random search 
in the conformational space for the native, biologically active shape would require folding 
time t ~ z N which is unrealistic for a typical protein with N ~ 100 (Levinthal paradox). In 
fact, real dynamics of a protein is far from random search. The native structure is believed 
to represent a pronounced minimum of energy flj-fll, so that folding process is driven by 
decrease in energy (opposed by entropy decrease). Moreover, it is clear that folding time 
should strongly depend on folding conditions. At very high temperature all conformations 
are equally favorable and folding time should indeed be close to Levinthal's estimate z . 
At very low temperature one should expect Arrhenius-like slowing-down caused by energy 
barrier (s) on a folding pathway. Thus, there should be some finite temperature optimal for 
folding in the sense that folding is fastest at this temperature. Such temperature dependence 
of the folding time was observed indeed in a number of computer simulations for different 
models [§-§. 

Much less is known about length dependence of the folding time. It is clear that folding 
time should grow when chain length increases but the actual dependence is not known. 
Though there are some attempts to estimate the folding time from scaling and other simple 
arguments f9HHl> these estimates are based on phenomenological, rather than microscopic, 



considerations. Experimental data are not readily available to elucidate this important 
point. The problem here is that folding can involve some kinetic events which are specific 
to a given protein, such as isomerisation of prolines and formation of disulphide bonds. In 
addition, large proteins typically consist of few domains with independent, to some extent, 
folding. 

In the present paper we study the length dependence of folding rate using Monte Carlo 
simulation of folding on a cubic lattice. Despite its simplicity and computational efficiency, 
this model proved useful in elucidating such crucial, experimentally observed, features of 
protein folding as cooperativity ]T2]-p^] , nucleation mechanism JTo}{i7|] and parallel pathways 
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Both chains with random sequences of amino acids and chains with sequences designed 
to provide a pronounced energy minimum for a particular compact conformation || are 
analyzed and compared. 

Protein chain is modeled as a self-avoiding walk on an infinite cubic lattice ||. Energy 
of a given conformation of a chain is given by 

= 5>K % ) A,, (1) 

i<j 

where Ay = 1 if monomers i and j are in contact and Ay = otherwise. Monomers % and 
j are considered to be in contact if they are separated by one lattice bond and \i — j\ ^ 1. 
Sequence of amino acids is given by Oj with % = 1,...,N, so that a, can take 20 different 
values corresponding to 20 natural amino acids. Finally, U(a, b) is the energy of a contact 
between amino acids a and b. In this work we use interaction matrix U(a, b) obtained in Ref. 
|T9|. from statistical analysis of protein structures. A number of previous studies suggest 



i"5| , [IT| that the results do not depend on a particular choice of interaction matrix. Dynamics 



of a chain is modeled by a standard Monte Carlo procedure for a cubic lattice P0 

First, we studied folding of chains with random sequences of monomers. 5 random 
sequences were generated for each chain length in the range from 10 to 50. A long Monte 
Carlo simulation was performed for each of these sequences, to identify the conformation 
with the lowest energy. Fig.l shows the lowest energy conformation for one of random chains Fig.l 
of 40 monomers. In order to make sure that there are no conformations with lower energies, 
additional 10 runs starting from different unfolded conformations were performed for each 
sequence. In each of these runs a chain folded to the putative lowest energy conformation, 
and no conformation with lower energy was encountered. 

From 5 random sequences generated for the same chain length, we selected one sequence 
which folding rate was the fastest. Then for this sequence we studied folding in a range of 
temperatures to determine the temperature, at which folding rate was fastest. Further, at 
this temperature, we made a more precise estimate of the mean first passage time (MFPT) 
to the lowest energy conformation by averaging over 50 runs starting from different unfolded 



conformations. The MFPT for all studied chain lengths is shown in Fig. 2. It is seen that 
folding time grows with chain length, and the dependence can be well described by a power- Fig 
law, that is t ~ N x with A « 6. 

Next, we studied the length dependence of folding rate for sequences designed to have 
their native conformations as pronounced energy minimum. Such sequences are more likely 
to exhibit protein- like behavior ||12|| . To this end, each of the lowest energy conformations 
found previously for random sequences was used as a target conformation for sequence 
design. The design procedure is described in detail elsewhere [|3|, [l4| , ^T| . We did not study 
random sequences longer than 50 units, while it is feasible to study folding rate of designed 



sequences up to 100 units long fll2| . To this end we generated random compact conformations 
to serve as target structures for sequence design of longer sequences (more than 50 units 
in length). Additional condition was applied that variation of energies of native contacts is 



low enough, in order to eliminate multi-domain behavior for longer sequences [[21] . In this 
way 5 sequences were generated for each of studied chain length, and the sequence with the 
fastest folding was selected. Length dependence of the folding time at optimal temperature 
for the designed sequences is presented in Fig.2 for chain lengths ranging from 10 to 100. It 
can be seen that for short chains folding time of designed sequences is about the same as 
that of random sequences. For longer chains, though, folding of designed sequences is much 
faster than folding of random sequences. Overall, length dependence of the folding time for 
designed sequences is well described by a power law t ~ iV A with A m 4. 

For comparison, we also studied the model introduced by Go and coworkers |IJ. In this 
model energy of a given conformation is given by 

H=-W^ (2) 

z i<j 

where Afi = 1 if monomers % and j are in contact in the native conformation and A}] = 
otherwise. In other words, in the Go model all the native interactions are favorable and 
all non-native interactions do not contribute to the energy. In some sense, the Go model 
corresponds to ideally designed sequences. 



For the Go model the same conformations as for designed sequences were used as native. 
Length dependence of the folding time at optimal temperature for the Go model is presented 
in Fig. 2 for chain lengths from 10 to 175. Again, the dependence is fitted very well by a 
power law t ~ iV A with A ~ 2.7. 

Fig. 3 shows the inverse temperature dependence of the folding time for the sequence Fig. 3 
designed to fold to the native conformation shown in Fig.l. It has a clear minimum at some 
optimal temperature. All the scaling dependencies for the folding rate discussed above were 
obtained at the conditions corresponding, for each sequence, to such an optimal temperature. 
What happens at lower temperatures? It is clear from Fig. 3 that at low temperatures 
logarithm of folding time depends linearly on inverse temperature exhibiting an Arrhenius- 
like behavior t ~ exp(Eb/T). This suggests that at low temperature folding is an activated 
process which entails overcoming of an energetic barrier Ej,. Similar dependencies were 
obtained for all studied chains and the corresponding energy barriers Ef, were determined for 
each sequence from the slopes of Arrhenius-like branches of these dependencies. The result 
is presented in Fig. 4 where Ef, is plotted as a function of energy of the native conformation, Fig. 4 
E n . It can be seen that energy of the low-temperature barrier scales linearly with energy 
of the native conformation with coefficient <fi ~ 0.18. Surprisingly, all the data for random 
and designed sequences and for the Go model fall on the same line suggesting a universal 
scaling behavior for the low-temperature energy barrier. This observation can be explained 
as follows. 

The main reason that folding is slow at low temperatures is that the chain gets stuck in 
some low energy misfolded conformations which needs to be unfolded before chain proceeds 
further. In the case of random sequences such trapped misfolded conformations are typically 
very different in structure from the native conformation p2|. Nevertheless, the energy of 



the misfolded traps is expected to be close to the energy of the native conformation E n |23 
In the case of designed sequences and Go model the native state has much lower energy 
than any conformation structurally unrelated to the native state. Therefore, for designed 
sequences, low-energy conformations serving as kinetic traps, have considerable structural 



similarity with the native state. As a result, the energy of deepest kinetic traps are again 
close to E n . In order to escape from such a misfolded state a chain has to break some part of 
the misfolded conformation. Our simulations suggest that both random and designed chains 
on a cubic lattice have to rearrange a fraction cf) of the whole structure (independently of 
its size), in order to escape from a trap. We should note though that the actual estimate 
of the number of bonds to be broken to escape from the native state may depend on lattice 
and dynamic algorithm used (e.g. move set). We cannot exclude that this barrier will be 
dramatically diminished in off-lattice model, so that low-temperature behavior in off-lattice 
model may be significantly different. 

As we pointed out above, the direct comparison between our main result and experiments 
on real proteins is a challenging task for experimentalists. This will require careful choice 
of proteins to study as model systems and conditions at which their folding is fastest. We 
hope that the presented results will stimulate such experiments. 

The major finding of this work is that the chain length dependence of folding rate is 
relatively weak (much weaker than exponential, suggested by Levinthal argument and some 
phenomenological estimates ||). This fact is of fundamental importance since it explains, 
from the folding perspective, the wide range of lengths of existing proteins (roughly 50-1000 
aminoacids). Exponential dependence of folding time on protein length would make folding 
of longer chains prohibitively slow. Another important application of the present result 
is that it can provide crucial and delicate test of the existing and future kinetic theories. 
Phenomenological analysis based on the Random Energy Model (REM) predicts exponential 
dependence of folding rate on system size at all temperatures ||. This is in contrast with 
simulation results and probably points out to inapplicability of the REM (and perhaps mean- 
field models in general) to tackle kinetic issues. This can be understood from the general 
perspective that folding transition is cooperative (first-order like). Such transitions follow 
nucleation mechanism at which the transition state is highly inhomogeneous, consisting of 
islands of the "new" phase in the sea of "old" phase [23|. Such an inhomogeneous distribution 



(representing the least-activation path) explains the weak size dependence of the rate of 



first-order transitions. It cannot be described by any global homogeneous order parameter. 
REM-based phenomenological models || , or approaches, based on Kramers theory |25| , use a 



global order parameter (the number of native-like contacts, Q) as kinetic reaction coordinate. 
Such approaches miss the inhomogeneity of the transition state and thus predict exponential 
size dependence of protein folding rate. 

A possible physical explanation of the power law rate dependence, for designed sequences 
is close to the arguments presented recently by Thirumalai [|ll|] . Since folding is cooperative 



(first-order like) process Hl2j|l3| , the transition state is reached when nucleus of the folded 



conformation is formed [|15| -[T7|| . The power-law length scaling of the folding rate implies 
that nucleus does not grow with chain length. In this case the length dependence of the 
folding rate appears due to entropic cost of loop closure around the folding nucleus. The free 



energy of loop closure depends logarithmically on its length | 26f , and this factor, combined 



with power-law length dependence of polymer relaxation time [^6] apparently translates into 
the overall power-law dependence of the folding rate. 

The difference between random and designed sequences is crucial . First of all we see 
that designed sequences fold faster at their respective optimal folding temperatures and the 
difference in folding rates between random and designed sequences becomes more pronounced 
as chains become longer (due to different exponents in the scaling laws describing length 
dependencies). 

Real proteins must not only reach their native state fast but also be stable in their 
native conformations. This points out to another crucial difference between random and 
designed sequences: designed sequences of different lengths are stable in the native state 
at the temperature of their fastest folding while longer random sequences get less stable at 
their corresponding fastest folding temperature. (Data not shown). 

Finally, we note that while power law scaling fits our data best, the range of chain lengths, 
which we studied, spanned only about an order of magnitude. This length range is most 
limited for random sequences; for them we cannot rule out a weak stretched exponential, 



rather than power length scaling flT|]. We believe that microscopic analytical theory of 



protein folding dynamics can give satisfactory answer to the important question of length 
dependence of protein folding time. It is our hope that presented results will stimulate the 
development of such theory. 
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Figure Captions 

Fig.l Lowest energy conformation for a random sequence of 40 amino acid residues. 

Fig. 2 Dependence of the folding time t on the chain length N for random sequences 
(black circles), for designed sequences (gray circles), and for the Go model (white circles). 
Folding time was estimated by averaging over 50 folding runs for each of the sequences. 

Fig. 3 Dependence of the folding time t on the temperature T for a random sequence of 40 
residues with the ground state shown in Fig.l. Straight line approximates low-temperature 
part of the dependence. The slope of the line is the energy barrier E b . 

Fig. 4 Dependence of the energy barrier E b on the energy of the native conformation E n 
for random sequences (black circles), for designed sequences (gray circles), and for the Go 
model (white circles). The line shows the best fit by E b = <j)\E n \ with = 0.18. 
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